Someone showed up on #haskell
yesterday, asking how to use regular
expressions. This isn’t a completely straightforward question to
answer. While Haskell’s regexp libraries provide the same
functionality you’ll find in Perl, Python, and Java, they provide a
rich and fairly abstract interface that can be daunting to newcomers.
So let’s fix that, and strip away the abstractions to show how we might actually use regexps in practice. I’m assuming that you’re already familiar with regexps from using them in some other language, and simply want to find your feet in Haskell. The standard library that implements regexps is Text.Regex.Posix. As the name suggests, this wraps the system’s native POSIX extended regexp library.
If you’re coming from a language like Perl, Python, or Java, you may not have encountered POSIX regular expressions before. Perl-style regexps are much more expressive, and hence far more widely used. POSIX regexps look superficially similar to Perl-style regexps, but they’re not as expressive, and they have different matching behaviour. There aren’t any concise online descriptions of the syntactic and semantic differences between the two regexp families, but the Wikipedia entry on regexps will help you to understand some of the differences. The only advantage of the POSIX regexp library is that it’s bundled with GHC; you don’t need to fetch any extra bits to get it to work.
Before we continue, let’s start up GHC’s interpreter, ghci
.
$ ghci
Loading package base ... linking ... done.
To use the Text.Regex.Posix module, we use the :mod
(or simply :m
)
command to load it and bring it into scope.
Prelude> :mod +Text.Regex.Posix
Prelude Text.Regex.Posix>
The simplest way to use regexps is via the “=~
” function, which we
normally use as an infix operator. And, of course, it’s exactly the
same operator that Perl uses for regexp matching.
What’s interesting is that this function is polymorphic in both its arguments and its return type; this is why its documentation can be hard to read if you’re new to the language. Here’s a simplified type signature, to give you the idea.
(=~) :: string -> pattern -> result
(I’ve left out the constraints on these type variables. The real type
signature of =~
is huge, and only obscures early understanding of
what on earth is going on, so I’m not going to reproduce it here.)
You can use either a normal Haskell string or a much more efficient
ByteString
as the pattern
or string
. You can use them in whatever
combination you like; a String for one, a ByteString for the other, or
whatever suits your needs.
If you try to use =~
interactively at the ghci
prompt, ghci
gives a fearsome error message.
> "bar" =~ "(foo|bar)"
<interactive>:1:0:
No instance for (Text.Regex.Base.RegexLike.RegexContext Regex
[Char]
result)
arising from use of `=~' at <interactive>:1:0-19
Possible fix:
add an instance declaration for
(Text.Regex.Base.RegexLike.RegexContext Regex [Char] result)
In the expression: "bar" =~ "(foo|bar)"
In the definition of `it': it = "bar" =~ "(foo|bar)"
Yikes! What’s happened here is that we haven’t given a type for the
result of the =~
operator. Since the result type is polymorphic,
ghci
has no way to infer what type we might actually want. We can
easily fix this, by suffixing the expression with a type signature.
> "bar" =~ "(foo|bar)" :: Bool
True
> "quux" =~ "(foo|bar)" :: Bool
False
By constraining the type of the result to Bool, we get a simple “yes” or “no” answer when we ask if the match has succeeded.
But we can also use String as the result type, which gives us the first string that matches, or an empty string if the match fails.
> let pat = "(foo[a-z]*bar|quux)"
> "foodiebar, fooquuxbar" =~ pat :: String
"foodiebar"
> "nomatch" =~ pat :: String
""
If we use [String], we get a list of every string that matches.
> "foodiebar, fooquuxbar" =~ pat :: String
["foodiebar","fooquuxbar"]
It’s also possible to get more detail about the context in which a match occurred. The 3-tuple in this result consists of the text before a match; the matched text; and the text after the match.
> "before foodiebar after" =~ pat :: (String,String,String)
("before ","foodiebar"," after")
The 4-tuple below adds the text of all subgroups of the match.
> "before foodiebar after" =~ pat :: (String,String,String,[String])
("before ","foodiebar"," after",["foodiebar"])
But wait, there’s more! You can get plain, simple offset information, either singly or in a list.
> :mod +Text.Regex.Base
> "the foodiebar" =~ pat :: (MatchOffset,MatchLength)
(4,9)
> "no match" =~ pat :: [(MatchOffset,MatchLength)]
[]
All of these different result types are instances of the RegexContext type class. By this point, you should have enough examples that you can read the complete documentation for the other RegexContext instances at Text.Regex.Base.Context without it seeming completely overwhelming. There are many more instances, some of which give you a lot of detailed information about a match.
There’s a monadic variant of the =~
operator, called =~~
. You can
use this to perform matches inside a monad, for example as follows.
This binds a
to the number of matches.
> a <- "foo" =~~ "foo":: IO Int
1
You can also use the =~~
operator outside of a monad in some cases.
For example, you can try a match in the Maybe monad to tidily handle
the possibility of a failure.
> "foo" =~~ "bar":: Maybe String
Nothing
> "foo" =~~ "foo" :: Maybe String
Just "foo"
Here’s an interesting pitfall to watch out for. There’s a small chance you could shoot yourself in the foot if you use a list as a RegexContext. Let me show you what I mean. This expression returns a list of all matches, which is what I’d normally expect.
> "foo foo foo" =~ "foo" :: [String]
["foo","foo","foo"]
But this expression, which differs in only one character, runs the match in the list monad! It’s only ever going to return an empty or single-entry list. Eeek!
> "foo foo foo" =~~ "foo" :: [String]
["foo"]
There’s normally no need to compile regular expression patterns. A pattern will be compiled the first time it’s used, and your Haskell runtime should memoise the compiled representation for you.
This is necessarily only a basic introduction to the Haskell regexp API. In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides. When I have time, I’ll talk about these problems in more detail, and what alternatives you can use. Until then, happy regexp matching!
Could you update the Haskell Cookbook with this information? There’s currently nothing there with regard to regular expressions.
http://www.haskell.org/haskellwiki/Cookbook#Regular_expressions
Also, this doesn’t seem to work in ghci 6.4.2:
Prelude Text.Regex.Posix> “foo foo foo” =~ “foo” :: [String]
:1:14: Not in scope: `=~’
Thanks,
Nick
Yep, ghc 6.4.2 is too old. You’ll need 6.6.
Wow, thanks!
I’ve been hoping someone would make a brief regex tutorial for some time now. It seems simple now that you walk through it, but when I first took a swing at it based on the documentation, the type signatures scared me off.
Also confusing to new folks is the lack of a regex replace in the new API.
I made an ad hoc one that is semi-speedy, but dealing with this issue in general might be a good advanced topic.
Example: http://hpaste.org/697
Thanks for the intro, its very good!
Unfortunately regular expressions just don’t work on Hugs, which makes them unuseable for me…
Typo?: “If we use [String], we get a list of every string that matches. > “foodiebar, fooquuxbar” =~ pat :: String”
Err… wow, I was expecting I might get to play the learn-the-local-formatting-syntax-by-trial-and-error game, but that _really_ managed to trash my comment :-). Trying again:
Typo?: “If we use [String], we get a list of every string that matches… “foodiebar, fooquuxbar” =~ pat :: String” I’m pretty sure this should be [String], not String?
Typo?: “It’s only ever going to return an empty single-entry list.” What is an empty single-entry list?
Not sure what I think about the API design where the return type is so magical. Isn’t it hard to remember, say, what exactly the third output from the four-tuple-of-strings-plus-a-list-of-strings overload does? Interesting to read about, though :-).
Very cool! I’m also really interested in the “next time”. I really like these practical Haskell examples, which show that Haskell is ready for “real world” programming too. I’m looking forward to an article about the performance and/or other disadvantages (hint, hint ;)).
Nathaniel –
Thanks for pointing out the typo. I meant “an empty or single-item list”, and I’ve corrected that.
As far as the magic polymorphic return types are concerned, in this instance I find them easy to remember. I put on my “what would I expect this field of the tuple to contain?” thinking cap, and lo, it gives me the right answer.
My Haskell-fu is weak, but is there any particular reason to style the regex lib like this? Why not just separate functions rather than adding a type signature to indicate what you’re looking for in terms of return value?
Hi…I edited the latest versions of regex packages described above. Thanks for the tutorial page.
I am using my time to update the packages at the moment, and will spend time on documentation after that, so this community help is great.
Additional info, in no particular order:
Using “import Text.Regex.Base” helps expose the whole API. This will be re-exported by backends in the next version. In particular I expect that regex-posix can be made to work with GHC-6.4.2 since I originally created it with that compiler.
=~ is just a function that combines ‘makeRegex’ from the RegexMaker class and ‘match’ from the RegexContext class:
(=~) x r =
let make :: RegexMaker Regex CompOption ExecOption a =>
a -> Regex
make = makeRegex
in match (make r) x
[ The explicit type signature is needed to avoid (show . read) type ambiguity. ]
You wrote:
> There’s normally no need to compile regular expression
> patterns. A pattern will be compiled the first time it’s used,
> and your Haskell runtime should memoise the compiled
> representation for you.
Which is wrong. If you use =~ then the regexp is compiled with makeRegex and probably not cached. If you need to cache the compiled form then you need to use makeRegex and hold onto the result and then use match separately (and matchM for =~~).
Andy wrote:
> Why not just separate functions rather than adding a type
> signature to indicate what you’re looking for in terms of
> return value?
The return-type-polymorphic match and matchM of RegexContex are built on top of the RegexLike class. The RegexLike class defined the separate functions you are looking for: matchAll, matchOnce, matchCount, matchTest, matchAllText, and matchOnceText with normally defined return types.
As for the GHC-centric nature….I could not make the fancy polymorphic RegexContext work without various type system extensions. The actual backends work without this high level type class API — I suspect one should be able strip the code for regex-base and regex-posix/pcre/parsec/dfa down a bit and get it working on Hugs, just without the fancy =~ and =~~. This would be best to do once I announce the next release.
In particular, if you look at the type-specific modules such as “Text.Regex.Posix.String” or “Text.Regex.Posix.Bytestring” they all export three functions named compile, regexec and execute. These can be used if you don’t want to go through any of the type classes at all. The Wrap modules like “Text.Regex.Posix.Wrap” expose even more below-type-class level functions, types, and constants.
Neil:
Unfortunately regular expressions just don’t work on Hugs, which makes them unuseable for me…
As someone coming to Haskell from Perl, my first thought was “Unfortunately regular expressions just don’t work on Hugs, which makes Hugs unuseable for me…”
Wow, awesome tutorial!
Coming from Perl, not being able to use regexes was a semi-major hurdle for me; thanks!
(Also, I like the fact that (=~) is so conveniently overloaded, very dwimmy! :))
About giving up on Hugs vs giving up on Regex. What if I don’t WANT to do a compilation using a compiler everytime I want to try something? How serious do I have to be about this language to “play the game” at all?
John: ghci is an interactive interpreter that’s a part of GHC, and runghc is its non-interactive counterpart, so you can certainly avoid the need to compile while not using Hugs.
Thank you for the hint! It’s nice to have such quick turnaround. Also nice to see you’re still “alive” on this thread.
Is it ironic I had to read this tutorial to remember how to use the regex syntax I invented in the first place? hmm….
Thanks for posting — the multiple return types of Haskell regexen are really cool!
Awesome post! It would have taken me…forever…to figure out these regexn otherwise.
I did some tests with =~, and it seems it *does* only compile the regexp once, if you compile using ghc -O. =~ is a very small function, so with optimizations turned on (-O) ghc will inline it. If you look at the core, the Regex is put in a top-level expression, so it will only be compiled once per program invocation.
This tutorial is exactly what I was looking for! Thanks!
Great post, also It would be nice if you have mentioned splitRegex. I noticed that you guys didn’t used this in RWH, pag 305 you did your own split rather than using it, any particular reason ?
Coding regular expressions as strings is great for getting quick work down, but a ghastly experience as the regular expressions grow.
Once one has used regular expressions written out in code as e.g. Bigloo Scheme allows, returning to regex strings is a huge “What could I possibly have been thinking!” recrimination. Sure, regex strings are familiar, but do I WANT to stay familiar with Perl?
The parser alternatives that Haskell offers work well, and express the regex as code, not crammed into a string.
I am unable to get the regular expression matching to work when I use Posix character classes such as [:digit:]. For example, if I use “^[:digit:] +INTRODUCTION”, this does not match the string “1 INTRODUCTION”, but if I use what is supposed to be equivalent, namely “^[0-9] +INTRODUCTION”, this does match the string “1 INTRODUCTION”. Any ideas what I am doing wrong?
Oh I forgot to say that the tutorial was very useful!
=> Peter,
use [[:digit:]] instead of [:digit:]. Common pitfall. Think about what you would have to write if you would enclose two or more character classes in the same range.
“foo foo foo” =~ “foo” :: [String] is really awesome. However it doesn’t work any more in GHC 7.0.3.
OK. I figure out how to do the above in latest GHC 7. Just use
“foo foo foo” =~ “foo” :: [[String]]
What is [[String]] ??? Yes, its the List of String List. Then you can use head/tail etc to get subgroup of matches. This is extremely useful, when you want to use captured parenthesized subexpressions.
“In practice, you will want to avoid the Text.Regex.Posix implementation; it’s terribly slow, and too strict besides.” —- Then the question is: what shall we use???
According to the comparision on
http://www.haskell.org/haskellwiki/Regular_expressions#Overview
http://www.haskell.org/haskellwiki/Regular_expressions#Sample_benchmark
I recommend regex-tdfa, which is very fast and also native implementation in haskell.
http://hackage.haskell.org/package/regex-tdfa
Doesn’t work for me.
haskell>(“22″ =~ “*”) :: Bool
:1:7:
No instances for (RegexMaker Regex CompOption ExecOption [Char],
RegexContext Regex [Char] Bool)
arising from a use of `=~’
Possible fix:
add instance declarations for
(RegexMaker Regex CompOption ExecOption [Char],
RegexContext Regex [Char] Bool)
In the expression: (“22″ =~ “*”) :: Bool
In an equation for `it': it = (“22″ =~ “*”) :: Bool
Great article! You can also use regular expression tester likes http://liveregex.com for realtime regex testing purpose.
Lolz. I tried John’s tool. It’s good regex tester. Thank you,
The tutorial was very interesting to read……thanks to share it….
Great starting tutorial, this language is kinda odd compared to others.